Introduction

Below we'll be performing analysis on football players dataset provided to us. We'll be performing list of steps mentioned below in our analysis:

  1. Load Dataset
  2. Clean Dataset
  3. Perform Basic Stats
  4. Perform groupby functions
  5. Visualize Dataset Columns and their Relationships

We'll start by importing default libraries for our purpose which we'll be using for analysis below.

In [1]:
import pandas as pd ## For maintaining dataframes.
import numpy as np ## For arrays

import matplotlib.pyplot as plt ## Data Visualization Library.
import plotly.graph_objects as go ## Data Visualization Library.
import plotly.express as px ## Data Visualization Library.
import hvplot.pandas ## Data Visualization Library.
import holoviews as hv ## Data Visualization Library.
import seaborn as sns ## Data Visualization Library.

import warnings ## Supress unwanted warnings

warnings.filterwarnings('ignore') ## Supress unwanted warnings

## Below command plots matplotlib plots into notebook
%matplotlib inline 

1. Load Dataset (Create DataFrame)

We'll start by loading CSV file and printing it's shape for information purpose.

In [2]:
df = pd.read_csv('CompleteDataset.csv')
print('dataset Size : ', df.shape)## Print (rows, columns)
df.head() ## Prints first 5 rows
dataset Size :  (17981, 48)
Out[2]:
Unnamed: 0 Name Age Photo Nationality Flag Overall Potential Club Club Logo ... Reactions Short passing Shot power Sliding tackle Sprint speed Stamina Standing tackle Strength Vision Volleys
0 0 Cristiano Ronaldo 32 https://cdn.sofifa.org/48/18/players/20801.png Portugal https://cdn.sofifa.org/flags/38.png 94 94 Real Madrid CF https://cdn.sofifa.org/24/18/teams/243.png ... 96 83 94 23 91 92 31 80 85 88
1 1 L. Messi 30 https://cdn.sofifa.org/48/18/players/158023.png Argentina https://cdn.sofifa.org/flags/52.png 93 93 FC Barcelona https://cdn.sofifa.org/24/18/teams/241.png ... 95 88 85 26 87 73 28 59 90 85
2 2 Neymar 25 https://cdn.sofifa.org/48/18/players/190871.png Brazil https://cdn.sofifa.org/flags/54.png 92 94 Paris Saint-Germain https://cdn.sofifa.org/24/18/teams/73.png ... 88 81 80 33 90 78 24 53 80 83
3 3 L. Suárez 30 https://cdn.sofifa.org/48/18/players/176580.png Uruguay https://cdn.sofifa.org/flags/60.png 92 92 FC Barcelona https://cdn.sofifa.org/24/18/teams/241.png ... 93 83 87 38 77 89 45 80 84 88
4 4 M. Neuer 31 https://cdn.sofifa.org/48/18/players/167495.png Germany https://cdn.sofifa.org/flags/21.png 92 92 FC Bayern Munich https://cdn.sofifa.org/24/18/teams/21.png ... 85 55 25 11 61 44 10 83 70 11

5 rows × 48 columns

2. Data Cleaning (Cleaning Name, Club, Wage, Value, etc)

Below We are cleaning data of columns which has special characters. We'll be remvoing special characters from Name, Club,Wage and Value columns. We also remove rows which has any NA values.

In [3]:
df = df.dropna() ## Drop rows which has NA values
## Replace special chracts by using "ascii" encoding. It loops through all strings and ignore chracters for which it can't find ascii mappings.
## This way all special chracters will be handled.
df['Name'] = [name.encode().decode('ascii', 'ignore') if isinstance(name,str) else name for name in df.Name]
df['Club'] = [club.encode().decode('ascii', 'ignore') if isinstance(club, str) else club for club in df.Club]

## We'll replace special character euro from Wage column. Then we'll remove K and multiply values by 1000.
df['Wage'] = df['Wage'].str.replace('€', '')
df['Wage'] = df['Wage'].str.replace('K', '')
df['Wage'] = df['Wage'].astype(np.int32) * 1000

## We'll remove special chracter euro from Value column.
df['Value'] = df['Value'].str.replace('€', '')

## Below method takes as input value and if it has K then multiplies by 1000 and if it has M then multiplies by 1000000.
def convert_values(val):
    if 'K' in val:
        new_val = val.replace('K','')
        new_val = float(new_val) * 1000
        return new_val
    elif 'M' in val:
        new_val = val.replace('M','')
        new_val = float(new_val) * 1000000
        return new_val
    else:
        return float(val)

## We'll replace all Values with appropriate conversion
df['Value'] = np.array([convert_values(val)  for val in df['Value']], dtype=np.float32)

## Removing unwanted whitespaces from strings
df['Club'] = df['Club'].str.strip()
df['Nationality'] = df['Nationality'].str.strip()

df.head()
Out[3]:
Unnamed: 0 Name Age Photo Nationality Flag Overall Potential Club Club Logo ... Reactions Short passing Shot power Sliding tackle Sprint speed Stamina Standing tackle Strength Vision Volleys
0 0 Cristiano Ronaldo 32 https://cdn.sofifa.org/48/18/players/20801.png Portugal https://cdn.sofifa.org/flags/38.png 94 94 Real Madrid CF https://cdn.sofifa.org/24/18/teams/243.png ... 96 83 94 23 91 92 31 80 85 88
1 1 L. Messi 30 https://cdn.sofifa.org/48/18/players/158023.png Argentina https://cdn.sofifa.org/flags/52.png 93 93 FC Barcelona https://cdn.sofifa.org/24/18/teams/241.png ... 95 88 85 26 87 73 28 59 90 85
2 2 Neymar 25 https://cdn.sofifa.org/48/18/players/190871.png Brazil https://cdn.sofifa.org/flags/54.png 92 94 Paris Saint-Germain https://cdn.sofifa.org/24/18/teams/73.png ... 88 81 80 33 90 78 24 53 80 83
3 3 L. Surez 30 https://cdn.sofifa.org/48/18/players/176580.png Uruguay https://cdn.sofifa.org/flags/60.png 92 92 FC Barcelona https://cdn.sofifa.org/24/18/teams/241.png ... 93 83 87 38 77 89 45 80 84 88
4 4 M. Neuer 31 https://cdn.sofifa.org/48/18/players/167495.png Germany https://cdn.sofifa.org/flags/21.png 92 92 FC Bayern Munich https://cdn.sofifa.org/24/18/teams/21.png ... 85 55 25 11 61 44 10 83 70 11

5 rows × 48 columns

3. Remove Unwanted Columns From DataFrame

Below we are selecting columns which are fit for analysis purpose. We are not taking column Unnamed, Photo, Flag and club flag as it won't be much helpful for analysis.

In [4]:
## We have taken ouc columns names Unnamed, Photo, Flat, Club Flag and ID as hey seems unusable columns for analysis.

useful_columns = ['Name', 'Age', 'Nationality', 'Overall', 'Potential', 'Club', 'Value', 'Wage', 'Special',
                  'Acceleration', 'Aggression', 'Agility', 'Balance', 'Ball control', 'Composure', 'Crossing',
                  'Curve', 'Dribbling', 'Finishing', 'Free kick accuracy', 'GK diving', 'GK handling', 
                  'GK kicking', 'GK positioning', 'GK reflexes', 'Heading accuracy', 'Interceptions', 'Jumping', 'Long passing', 
                  'Long shots', 'Marking', 'Penalties', 'Positioning', 'Reactions', 'Short passing', 'Shot power', 
                  'Sliding tackle', 'Sprint speed', 'Stamina', 'Standing tackle', 'Strength', 'Vision', 'Volleys']

df = df[useful_columns]

df.head()
Out[4]:
Name Age Nationality Overall Potential Club Value Wage Special Acceleration ... Reactions Short passing Shot power Sliding tackle Sprint speed Stamina Standing tackle Strength Vision Volleys
0 Cristiano Ronaldo 32 Portugal 94 94 Real Madrid CF 95500000.0 565000 2228 89 ... 96 83 94 23 91 92 31 80 85 88
1 L. Messi 30 Argentina 93 93 FC Barcelona 105000000.0 565000 2154 92 ... 95 88 85 26 87 73 28 59 90 85
2 Neymar 25 Brazil 92 94 Paris Saint-Germain 123000000.0 280000 2100 94 ... 88 81 80 33 90 78 24 53 80 83
3 L. Surez 30 Uruguay 92 92 FC Barcelona 97000000.0 510000 2291 88 ... 93 83 87 38 77 89 45 80 84 88
4 M. Neuer 31 Germany 92 92 FC Bayern Munich 61000000.0 230000 1493 58 ... 85 55 25 11 61 44 10 83 70 11

5 rows × 43 columns

In [5]:
df.dtypes ## For checking datatype of each column. object data type refers to string.
Out[5]:
Name                   object
Age                     int64
Nationality            object
Overall                 int64
Potential               int64
Club                   object
Value                 float32
Wage                    int64
Special                 int64
Acceleration           object
Aggression             object
Agility                object
Balance                object
Ball control           object
Composure              object
Crossing               object
Curve                  object
Dribbling              object
Finishing              object
Free kick accuracy     object
GK diving              object
GK handling            object
GK kicking             object
GK positioning         object
GK reflexes            object
Heading accuracy       object
Interceptions          object
Jumping                object
Long passing           object
Long shots             object
Marking                object
Penalties              object
Positioning            object
Reactions              object
Short passing          object
Shot power             object
Sliding tackle         object
Sprint speed           object
Stamina                object
Standing tackle        object
Strength               object
Vision                 object
Volleys                object
dtype: object

As we can see majority of columns are of object type, we'll be only considering columns with datatype as integer or float. We'll be considering columns like Nationality and Club as well for analysis which are of string type.

4. Basic Stats (5 different stats)

We'll be considering columns Nationality, Club, Age, Overall, Potential, Value, Wage and Special for analysis.

Max Values for Age, Overall, Potential, Value, Wage and Special

In [6]:
df[['Age', 'Overall', 'Potential', 'Value', 'Wage', 'Special']].max()
Out[6]:
Age                 47.0
Overall             94.0
Potential           94.0
Value        123000000.0
Wage            565000.0
Special           2291.0
dtype: float64

Min Values for Age, Overall, Potential, Value, Wage and Special

In [7]:
df[['Age', 'Overall', 'Potential', 'Value', 'Wage', 'Special']].min()
Out[7]:
Age            16.0
Overall        46.0
Potential      46.0
Value           0.0
Wage         1000.0
Special       728.0
dtype: float64

Average Values for Age, Overall, Potential, Value, Wage and Special

In [8]:
df[['Age', 'Overall', 'Potential', 'Value', 'Wage', 'Special']].mean()
Out[8]:
Age          2.510940e+01
Overall      6.623279e+01
Potential    7.121198e+01
Value        2.418750e+06
Wage         1.170845e+04
Special      1.594386e+03
dtype: float64

Median Values for Age, Overall, Potential, Value, Wage and Special

In [9]:
df[['Age', 'Overall', 'Potential', 'Value', 'Wage', 'Special']].median()
Out[9]:
Age              25.0
Overall          66.0
Potential        71.0
Value        700000.0
Wage           4000.0
Special        1633.0
dtype: float64

Mode Values for Age, Overall, Potential, Value, Wage and Special

In [10]:
df[['Age', 'Overall', 'Potential', 'Value', 'Wage', 'Special']].mode()
Out[10]:
Age Overall Potential Value Wage Special
0 25 66 70 1100000.0 1000 1755

Standard Deviation Values for Age, Overall, Potential, Value, Wage and Special

In [11]:
df[['Age', 'Overall', 'Potential', 'Value', 'Wage', 'Special']].std()
Out[11]:
Age          4.609863e+00
Overall      7.001454e+00
Potential    6.106766e+00
Value        5.383791e+06
Wage         2.320012e+04
Special      2.719311e+02
dtype: float64

Correlation between Age, Overall, Potential, Value, Wage and Special

Correlation tells us how each columns are related to each other and how much relation exist between them.

In [12]:
correlation = df.corr()
correlation
Out[12]:
Age Overall Potential Value Wage Special
Age 1.000000 0.460797 -0.217946 0.083475 0.155308 0.242059
Overall 0.460797 1.000000 0.683018 0.630999 0.598918 0.606545
Potential -0.217946 0.683018 1.000000 0.590996 0.521456 0.399892
Value 0.083475 0.630999 0.590996 1.000000 0.850061 0.379608
Wage 0.155308 0.598918 0.521456 0.850061 1.000000 0.370389
Special 0.242059 0.606545 0.399892 0.379608 0.370389 1.000000

Describing Few Basic Stats About Age, Overall, Potential, Value, Wage and Special

In [13]:
df.describe()
Out[13]:
Age Overall Potential Value Wage Special
count 17733.000000 17733.000000 17733.000000 17733.0 17733.000000 17733.000000
mean 25.109401 66.232786 71.211978 2418782.5 11708.453166 1594.386003
std 4.609863 7.001454 6.106766 5383892.0 23200.122784 271.931089
min 16.000000 46.000000 46.000000 0.0 1000.000000 728.000000
25% 21.000000 62.000000 67.000000 325000.0 2000.000000 1449.000000
50% 25.000000 66.000000 71.000000 700000.0 4000.000000 1633.000000
75% 28.000000 71.000000 75.000000 2100000.0 12000.000000 1785.000000
max 47.000000 94.000000 94.000000 123000000.0 565000.000000 2291.000000

5. Sort Player By Club and Wage

In [14]:
df = df.sort_values(by =["Club", "Wage"])
df[['Club', 'Name', 'Wage']].head()
Out[14]:
Club Name Wage
15152 1. FC Heidenheim I. Hajtic 1000
16232 1. FC Heidenheim M. Kbbing 1000
13631 1. FC Heidenheim K. Lankford 2000
13392 1. FC Heidenheim D. Widemann 3000
13397 1. FC Heidenheim D. Gnaase 3000

6. Group By Countries

In [15]:
players_grouped_by_countries = df.sort_values("Nationality").set_index('Nationality')[['Name']]
players_grouped_by_countries.head(10)
Out[15]:
Name
Nationality
Afghanistan M. Kouhyar
Afghanistan N. Husin
Afghanistan S. Mukhammad
Albania S. Binakaj
Albania E. Hysaj
Albania S. Gashi
Albania B. Kukeli
Albania E. Lenjani
Albania O. Roshi
Albania G. Taipi
In [16]:
players_grouped_by_countries.tail(10)
Out[16]:
Name
Nationality
Zimbabwe M. Munetsi
Zimbabwe T. Kadewere
Zimbabwe K. Musona
Zimbabwe W. Katsande
Zimbabwe E. Chirambadare
Zimbabwe T. Hadebe
Zimbabwe T. Ndoro
Zimbabwe M. Nakamba
Zimbabwe C. Nhamoinesu
Zimbabwe K. Benyu

7. Average Potential of Players Grouped By their Age

In [17]:
average_players_potential_by_age = df.groupby(by='Age').mean()[['Potential']]
average_players_potential_by_age
Out[17]:
Potential
Age
16 76.538462
17 73.492248
18 72.654762
19 72.806754
20 72.719935
21 72.805947
22 72.555217
23 72.504734
24 72.113379
25 70.844459
26 70.570109
27 69.721239
28 69.319648
29 69.798365
30 69.878361
31 69.798771
32 69.470825
33 69.711697
34 68.868421
35 69.213904
36 68.938931
37 68.447761
38 68.470588
39 68.157895
40 67.857143
41 64.333333
43 55.000000
44 70.500000
47 46.000000

8. Data Visualizations

8.1 Bar Chart of Average Potential by Age (Holoviews Library)

Below we are plotting bar chart which shows average potential of players per age.

In [18]:
average_players_potential_by_age.hvplot(kind='bar', title='Average Potential Of Player By Age')
Out[18]:

List of Findings:

  • We can notice that age age increases, Potential decrease.
  • Potential is high till age of 25-30 and then decrease further.

8.2 Heatmap showing Correlation between Age, Overall, Potential, Value, Wage and Special (Matplotlib Library)

Below we are using heatmap to show correlation between age, overall, potential, value, wage and special which we calculated above. We are also showing colorbar which show intensity of relation. Correlation tells us how much particular column is related to another column.

In [19]:
with plt.style.context(('seaborn', 'ggplot')): ## It gives look of seaborn and ggplot to matplotlib.
    fig = plt.figure(figsize=(10,10)) ## Declaring figure
    plt.matshow(correlation, cmap=plt.cm.Blues, fignum=1) ## Matshow for plotting heatmapt with blues colormap.
    plt.xticks(range(len(correlation.index)), correlation.index) # Naming xticks
    plt.yticks(range(len(correlation.index)), correlation.index) # Naming yticks
    plt.title("Heat Map displaying correlation") 
    plt.colorbar() 
    plt.grid(None)
    
    ## Below code loops through all values of correlation and adds them in as red text in chart
    for i in range(len(correlation.index)): 
        for j in range(len(correlation.index)):
            plt.text(j,i, "%.2f"%correlation.values[i,j], 
                     horizontalalignment='center',
                     verticalalignment='center',
                     color='red',
                     fontsize=12)

List of Findings:

  • Potential has quite good relation with Overall(0.68), Value(0.59) and Wage(0.52).
  • Special has quite good relation with Overall(0.61).
  • Overall has good relation with Wage(0.60) and Special(0.61)

8.3 Bar Chart of Count of Player per Country (Holoviews Library)

Below we are plotting count of players per country by considering only countries which has more than 100 players to avoid graph from getting very crowded.

In [20]:
%%opts Bars [height=500 width=900 xrotation=90 title="Number Of Players Per Country(>100 Players)"]

## Groupby players per nationlity and count them and then take Name column and rename it to count
number_of_players_per_country = df.groupby(by="Nationality").count()[['Name']].rename(columns={'Name':'Count'})
## Sort players by Count
number_of_players_per_country = number_of_players_per_country.sort_values(by="Count")
## Consider only rows where number of players are more than 100
number_of_players_per_country = number_of_players_per_country[number_of_players_per_country.Count > 100]
## Plot bar chart showing count
number_of_players_per_country.hvplot(kind="bar", color="tomato")
Out[20]:

List of Findings:

  • Countries like England, Germany, Spain, France, Argentina, Brazil and Italy has quite good representation of more than 800 players.
  • Other countries has representation of somewhere between 100-800.

8.4 Bar Chart depicting Average Wage(Euro) Per Club (Holoviews Library)

Below we are plotting bar chart showing average wage per club using holviews library. We are only considering entries where average wage is above 50k euro otherwise chart will become crowded to understand if we include all clubs.

In [21]:
%%opts Bars [height=500 width=900 xrotation=90 title="Average Wage Per Club(>50,000 Euro)"]

## Group rows by club and take mean of Wage. Rename Wage column to Average_wage
avg_wage_by_club = df.groupby(by="Club").mean()[['Wage']].rename(columns={'Wage':'Average_Wage'})
## Sort rows by average_wage
avg_wage_by_club = avg_wage_by_club.sort_values(by="Average_Wage")
## Take only rows where average wage is greater than 50k.
avg_wage_by_club = avg_wage_by_club[avg_wage_by_club.Average_Wage > 50000]
avg_wage_by_club.hvplot(kind="bar", color="lawngreen")
Out[21]:

List Of Findings:

  • FC Barcelona and Real Madrid has highest average wages compared to other clubs of more than 180K.
  • Other clubs has average wage between 50K-125K.

8.5 Scatter Plot of Age vs Wage Color-Encoded by Potential (Seaborn Library)

Below we are plotting scatter plot showing relation between age and wage. We also have color-encoded markers according to potential.

In [22]:
plt.figure(figsize=(15,8))
sns.scatterplot(data=df, x="Age", y="Wage" , size="Potential", hue="Potential")
plt.title("Scatter plot explaining relationship between Age and Wage color-encoded by Potential");

List of Findings:

  • Players between age 25-35 seems to get more age compared to other age brackets.
  • There are few exceptions but majority of highest wage earners are between 25-35.
  • Top wage earners are between age of 30-35 as their potential also seems quite high of nearly 100.

8.6 Scatter Matrix Showing Distribution and Relation between Age, Overall, Potential, Value, Wage and Special (Matplotlib Library)

Below we are using pandas internal scatter matrix function which plots scatter matrix using matplotlib library. Scatter matrix is plotted for rows Age, Overall, Potential, Value, Wage and Special. All diagonal plots of scatter matrix are histograms showing distribution of that column and all non-diagonal charts are scatter plot showing relation between that 2 columns intersecting there.

In [23]:
pd.plotting.scatter_matrix(df, figsize=(18,18), diagonal='hist', color="tomato");

List of Findings:

  • Potential has linear relationship with Overall.
  • Value for most players is less than 25M (Value Histogram).
  • Wage for most players is less than 20K(Wage Histogram).

8.7 Bar Chart of Top 10 Average (Overall, Potential) per Club (Holviews Library)

Below we are plotting bar chart showing average (Overall,Potential) per club for top 10 clubs according to most value of Overall and Potential.

In [24]:
%%opts Bars [height=500 width=1100 xrotation=90 title="Top 10 Clubs By Average Potential & Overall"]

## Group rows by club and then take mean of Potential and Overall.
avg_potential_overall_by_club = df.groupby(by="Club").mean()[['Potential', 'Overall']]
## Sort values by first Potential and then Overall.
avg_potential_overall_by_club = avg_potential_overall_by_club.sort_values(by=["Potential","Overall"], ascending=False)
## Plot them as bar chart.
avg_potential_overall_by_club.head(10).hvplot(y=['Potential', 'Overall'], kind="bar")
Out[24]:

List of Findings:

  • Real Madrid has highest Potential compared to other clubs.
  • FC Barcelona has highest Overall.

8.8 Pie Chart of Countries With Most Players (Plotly Library)

Below we have used pie chart to display distribution of players from all countries. These countries has most players in dataset.

In [25]:
players_count_per_country = df.groupby("Nationality").count()[['Name']].rename(columns={"Name":"Count"})
players_count_per_country = players_count_per_country.sort_values(by="Count", ascending=False)#.head(10)

fig = go.Figure(data=[go.Pie(labels=players_count_per_country.index, values=players_count_per_country.Count)])
fig.update_layout(height=800,width=800, title="Countries with number of players")
fig.show()

List of Findings:

  • Countries like England, Germany, Spain, France, Argentina, Brazil and Italy has lot of representation with more than 800 players per country.
  • England has highest representation with 1630 players.